BUDT 758X Project Proposal: Career Analysis

Group 10 (Group One ): Jiading Chen, Changnan Jing, Yuanhao Zhang, Shlok Jethwa

Introduction

As current students of the Information Systems program, and with diverse backgrounds, the main problem that a student faces is to decide in which direction should he/she forward his/her career. This is due to the reason that many of us have done an undergraduate in business, the software side, the electrical side, etc. Therefore this project will help them to anticipate what kind of work pays the most amount of money (P.S Money is very important :-P). This project will show the different kinds of salaries by college attended, program majored and even the region of living. In a nutshell, a student can decide their career plan based on our project.

Questions of Interest

What are the benefits of studying in the Ivy League as compared to other universities?

Which region has the best starting salary in the U.S at graduation and mid-career?

What are the salary figures of students from different universities ten years down the line?

What are the school rankings based on salaries now and even ten years after graduation?

What are the majors that offer the most salary increase in the coming years after graduation?

What are the significant factors that affect mid-career salary and how significant are they?

Limitation of the project

The current data we used is from Wall Street Journal, which is considered as a reliable source in the industry. There are couple constraint of the data that we want to inform readers of this project. First of all, the current dataset doesn't contain all the schools and majors in the U.S. Also, the data are collected several years ago so they may not be representative for the current situation. Therefore, all the conclusions we got in this project are based on the dataset.

In this project, we are not aimming to find the exact number of salary for each school, region or major but we would like to investigate the correlation between all these factors including starting salary, SAT score, mjors, school types and school regions. Moreover, we want to prove that we can forecast the mid-career salary with all these factors.

1. Tidying Data

Read csv file we need and install basic packages

Data source: https://www.kaggle.com/wsj/college-salaries#salaries-by-college-type.csv

Package Installation. Major packages we used in this project is pandas and matplotlib.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import sklearn
import seaborn as sns
import statsmodels.api as sm
import plotly.graph_objects as go
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
import requests
geolocator = Nominatim(user_agent="specify_your_app_name_here")
import socket
s = socket.socket()
s.settimeout(1000)

We first read data in pandas dataframe and then cast string data from object to string and numeric data to float for further analysis and modification.

In [9]:
#User header to access the website 
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
poke_url = 'https://github.com/chshaw/SAT-ACT-Scores/blob/master/data/sat_2017.csv'
r = requests.get(poke_url, headers=header)
#Convert html into a dataframe
dfs = pd.read_html(r.text)
df = dfs[0].sort_values(by='State', ascending=True)
df = df.drop('Unnamed: 0',axis=1)
#Read state to get abrreviation for state 
#state = pd.read_csv('states.csv')
#df['State'] = state['Abbreviation']
df.head()
Out[9]:
State Participation Evidence-Based Reading and Writing Math Total
0 Alabama 5% 593 572 1165
1 Alaska 38% 547 533 1080
2 Arizona 30% 563 553 1116
3 Arkansas 3% 614 594 1208
4 California 53% 531 524 1055
In [11]:
college_type = pd.read_csv('salaries-by-college-type.csv')
college_type['School Type'] = college_type['School Type'].astype(str)
dummies = pd.get_dummies(college_type['School Type'])
college_type = pd.concat([college_type, dummies], axis = 1)
college_type = college_type.dropna()
college_type.head()
Out[11]:
School Name School Type Starting Median Salary Mid-Career Median Salary Mid-Career 10th Percentile Salary Mid-Career 25th Percentile Salary Mid-Career 75th Percentile Salary Mid-Career 90th Percentile Salary Engineering Ivy League Liberal Arts Party State
0 Massachusetts Institute of Technology (MIT) Engineering $72,200.00 $126,000.00 $76,800.00 $99,200.00 $168,000.00 $220,000.00 1 0 0 0 0
3 Polytechnic University of New York, Brooklyn Engineering $62,400.00 $114,000.00 $66,800.00 $94,300.00 $143,000.00 $190,000.00 1 0 0 0 0
5 Worcester Polytechnic Institute (WPI) Engineering $61,000.00 $114,000.00 $80,000.00 $91,200.00 $137,000.00 $180,000.00 1 0 0 0 0
6 Carnegie Mellon University (CMU) Engineering $61,800.00 $111,000.00 $63,300.00 $80,100.00 $150,000.00 $209,000.00 1 0 0 0 0
7 Rensselaer Polytechnic Institute (RPI) Engineering $61,100.00 $110,000.00 $71,600.00 $85,500.00 $140,000.00 $182,000.00 1 0 0 0 0

Clean data

After reading in data, we need to do a further cleaning to make our data more readable and manipulatable.

In [12]:
cols = ['Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 25th Percentile Salary',
       'Mid-Career 75th Percentile Salary',
       'Mid-Career 90th Percentile Salary']
# Filter out unnecessary symbols for analysis
for col in cols:
    college_type[col] = college_type[col].str.replace('$','')
    college_type[col] = college_type[col].str.replace(',','')
    
#Cast string to float
for j in range(2,8):
    for i in range(len(college_type)):
        college_type.iloc[i,j] = float(college_type.iloc[i,j])
        
#Clean and categorize data
college_type_engineering = college_type[college_type['Engineering'] == 1].iloc[:,0:-5]
college_type_IL = college_type[college_type['Ivy League'] == 1].iloc[:,0:-5]
college_type_LA = college_type[college_type['Liberal Arts'] == 1].iloc[:,0:-5]
college_type_state = college_type[college_type['State'] == 1].iloc[:,0:-5]
college_type_party = college_type[college_type['Party'] == 1].iloc[:,0:-5]

2. Exploratary Data Analysis and Data Visualization

2.1 Analysis about types and regions of school

First, we are going to start our analysis with different types of universities. We choose to use mean value of data to represent the average level of categories.

In [13]:
college_type_engineering_mean = college_type_engineering.mean()
college_type_IL_mean = college_type_IL.mean()
college_type_LA_mean = college_type_LA.mean()
college_type_state_mean = college_type_state.mean()
college_type_party_mean = college_type_party.mean()
college_type_salary_mean = [college_type_engineering_mean, college_type_IL_mean, college_type_LA_mean, college_type_state_mean, college_type_party_mean]
college_type_salary = pd.DataFrame(college_type_salary_mean).T
college_type_salary.columns = list(college_type['School Type'].unique())
college_type_salary
Out[13]:
Engineering Party Liberal Arts Ivy League State
Starting Median Salary 57440.000000 60475.0 46171.428571 44126.285714 45878.947368
Mid-Career Median Salary 101373.333333 120125.0 89635.714286 78567.428571 84742.105263
Mid-Career 10th Percentile Salary 61793.333333 57900.0 47478.571429 41886.285714 44052.631579
Mid-Career 25th Percentile Salary 79913.333333 82787.5 63235.714286 56689.714286 60315.789474
Mid-Career 75th Percentile Salary 130433.333333 184125.0 127828.571429 106970.857143 117842.105263
Mid-Career 90th Percentile Salary 173333.333333 269625.0 191142.857143 147571.428571 166947.368421
In [14]:
#draw a barplot about the salary of different school type
college_type_salary.plot(kind = 'bar',figsize = (20,10))
plt.ylabel('Salary', fontsize = 25)
plt.xticks(rotation = 'vertical',fontsize = 17)
plt.yticks(fontsize = 20)
plt.title('Salary of different school type', fontsize = 25)
plt.legend(fontsize = 20)
plt.show()

From the bar polt, we are suprised to see that party schools have the highest salary in most periods of time. What's more, the line chart also tells us that party schools have a more rapid increasing speed of salary, which is contrary to what we think normally.

We continue to analyze salary distribution according to different regions in the U.S.

In [15]:
#prepare the data for radar plot
c_t = college_type_salary.T
c_t_eng = c_t.loc['Engineering']
c_t_party = c_t.loc['Party']
c_t_LA = c_t.loc['Liberal Arts']
c_t_IL = c_t.loc['Ivy League']
c_t_S = c_t.loc['State']
In [16]:
c_t_S
Out[16]:
Starting Median Salary                45878.947368
Mid-Career Median Salary              84742.105263
Mid-Career 10th Percentile Salary     44052.631579
Mid-Career 25th Percentile Salary     60315.789474
Mid-Career 75th Percentile Salary    117842.105263
Mid-Career 90th Percentile Salary    166947.368421
Name: State, dtype: float64
In [54]:
!!pip install pygal
Out[54]:
['Requirement already satisfied: pygal in c:\\programdata\\anaconda3\\lib\\site-packages (2.4.0)']
In [55]:
#draw a barplot about the salary of different school types
import pygal
#call radar and set the parameters
radar_chart = pygal.Radar(fill = True, range=(10000,300000))
radar_chart.title = 'Salary By School Type'
#set label for each angle
radar_chart.x_labels = ['Starting','Mid-Career 10th','Mid-Career 25th','Mid-Career','Mid-Career 75th','Mid-Career 90th']

#draw the lines
radar_chart.add('Engineer', c_t_eng)
radar_chart.add('Party', c_t_party)
radar_chart.add('Liberal Arts', c_t_LA)
radar_chart.add('Ivy League', c_t_IL)
radar_chart.add('State', c_t_S)
radar_chart.render_to_file('radar_chart.svg')
In [56]:
# dsiplay the radar plot
from IPython.display import SVG, display
def show_svg():
    display(SVG('radar_chart.svg'))
    
In [57]:
show_svg()
Salary By School Type2800002800002400002400002000002000001600001600001200001200008000080000400004000000Starting1.570796327Mid-Career 10th2.617993878Mid-Career 25th3.665191429Mid-Career4.71238898Mid-Career 75th5.759586532Mid-Career 90th6.806784083Salary By School Type57440295.6211.438024691358Starting101373.3333215.50345586608876216.79483539094645Mid-Career 10th61793.33333246.7761891636707281.50754115226334Mid-Career 25th79913.33333295.6320.3880452674897Mid-Career130433.3333398.6572725226867308.7305041152264Mid-Career 75th173333.3333432.5531860766602188.25514403292192Mid-Career 90th60475295.6209.0306327160494Starting120125200.68749147619928209.35783179012338Mid-Career 10th57900249.8523684201618279.9634259259259Mid-Career 25th82787.5295.6322.66786265432097Mid-Career184125441.0798387674906330.02488425925935Mid-Career 75th269625508.6346315149202150.0653935185187Mid-Career 90th46171.42857295.6220.3763668430335Starting89635.71429224.77751928258215221.45003306878306Mid-Career 10th47478.57143258.08649061051267275.8302358906525Mid-Career 25th63235.71429295.6307.15919312169314Mid-Career127828.5714396.5992122813699307.697442680776Mid-Career 75th191142.8571446.62474970102033181.19179894179905Mid-Career 90th44126.28571295.6221.99859435626104Starting78567.42857233.5227327036103225.8397698412698Mid-Career 10th41886.28571262.50503683756506273.6123077601411Mid-Career 25th56689.71429295.6301.96684126984127Mid-Career106970.8571380.11922905614733299.4251701940036Mid-Career 75th147571.4286412.1983306735083198.47244268077606Mid-Career 90th45878.94737295.6220.60836582196225Starting84742.10526228.64403090019323223.3908625730994Mid-Career 10th44052.63158260.7933754819339274.47149122807014Mid-Career 25th60315.78947295.6304.8430799220273Mid-Career117842.1053388.7087602102152303.73676088369075Mid-Career 75th166947.3684427.507542379099190.78784925276165Mid-Career 90thEngineerPartyLiberal ArtsIvy LeagueState

From the radar plot, we can see that
1.A funny one: The students in party school have the highest salaries in almost every stage and the students in Ivy league have the lowest ones.
2.The salary difference in party school is largest, and the difference in Ivy League is smallest.
We think the reason why we get that conclusion is the admission criteria and level of majors in party schools varies greatly, so the salary after they graduate varies greatly, and the high salary level students will raise the mean salary. On the contrary, the Ivy League school has close level students and majors, so their salaries are close and more reasonable.

In [22]:
college_region = pd.read_csv('salaries-by-region.csv')
college_region['Region'] = college_region['Region'].astype(str)
dummies = pd.get_dummies(college_region['Region'])
college_region = pd.concat([college_region, dummies], axis = 1)
#college_type.drop('Party', axis = 1, inplace = True)
college_region = college_region.dropna()
college_region.head()
Out[22]:
School Name Region Starting Median Salary Mid-Career Median Salary Mid-Career 10th Percentile Salary Mid-Career 25th Percentile Salary Mid-Career 75th Percentile Salary Mid-Career 90th Percentile Salary California Midwestern Northeastern Southern Western
0 Stanford University California $70,400.00 $129,000.00 $68,400.00 $93,100.00 $184,000.00 $257,000.00 1 0 0 0 0
3 University of California, Berkeley California $59,900.00 $112,000.00 $59,500.00 $81,000.00 $149,000.00 $201,000.00 1 0 0 0 0
5 Cal Poly San Luis Obispo California $57,200.00 $101,000.00 $55,000.00 $74,700.00 $133,000.00 $178,000.00 1 0 0 0 0
6 University of California at Los Angeles (UCLA) California $52,600.00 $101,000.00 $51,300.00 $72,500.00 $139,000.00 $193,000.00 1 0 0 0 0
7 University of California, San Diego (UCSD) California $51,100.00 $101,000.00 $51,700.00 $75,400.00 $131,000.00 $177,000.00 1 0 0 0 0
In [23]:
for col in cols:
    college_region[col] = college_region[col].str.replace('$','')
    college_region[col] = college_region[col].str.replace(',','')
for j in range(2,8):
    for i in range(len(college_region)):
        college_region.iloc[i,j] = float(college_region.iloc[i,j])

college_type_engineering_mean = college_type_engineering.mean()
college_type_IL_mean = college_type_IL.mean()
college_type_LA_mean = college_type_LA.mean()
college_type_state_mean = college_type_state.mean()
college_type_party_mean = college_type_party.mean()
college_type_salary_mean = [college_type_engineering_mean, college_type_IL_mean, college_type_LA_mean, college_type_state_mean, college_type_party_mean]
college_type_salary = pd.DataFrame(college_type_salary_mean).T
college_type_salary.columns = list(college_type['School Type'].unique())
college_type_salary
Out[23]:
Engineering Party Liberal Arts Ivy League State
Starting Median Salary 57440.000000 60475.0 46171.428571 44126.285714 45878.947368
Mid-Career Median Salary 101373.333333 120125.0 89635.714286 78567.428571 84742.105263
Mid-Career 10th Percentile Salary 61793.333333 57900.0 47478.571429 41886.285714 44052.631579
Mid-Career 25th Percentile Salary 79913.333333 82787.5 63235.714286 56689.714286 60315.789474
Mid-Career 75th Percentile Salary 130433.333333 184125.0 127828.571429 106970.857143 117842.105263
Mid-Career 90th Percentile Salary 173333.333333 269625.0 191142.857143 147571.428571 166947.368421
In [24]:
college_region_CA = college_region[college_region['California'] == 1].iloc[:,0:-5]
college_region_W = college_region[college_region['Western'] == 1].iloc[:,0:-5]
college_region_MW = college_region[college_region['Midwestern'] == 1].iloc[:,0:-5]
college_region_S = college_region[college_region['Southern'] == 1].iloc[:,0:-5]
college_region_NE = college_region[college_region['Northeastern'] == 1].iloc[:,0:-5]
college_region_CA_mean = college_region_CA.mean()
college_region_W_mean = college_region_W.mean()
college_region_MW_mean = college_region_MW.mean()
college_region_S_mean = college_region_S.mean()
college_region_NE_mean = college_region_NE.mean()
college_region_salary_mean = [college_region_CA_mean, college_region_W_mean,college_region_MW_mean,college_region_S_mean,college_region_NE_mean]
college_region_salary = pd.DataFrame(college_region_salary_mean).T
college_region_salary.columns = list(college_region['Region'].unique())
college_region_salary
Out[24]:
California Western Midwestern Southern Northeastern
Starting Median Salary 50072.727273 44932.352941 44460.9375 44946.478873 48679.268293
Mid-Career Median Salary 91718.181818 79541.176471 78178.1250 80018.309859 90929.268293
Mid-Career 10th Percentile Salary 47777.272727 42985.294118 43076.5625 43074.647887 49101.219512
Mid-Career 25th Percentile Salary 66686.363636 57741.176471 57348.4375 58008.450704 65715.853659
Mid-Career 75th Percentile Salary 123618.181818 106891.176471 106864.0625 110022.535211 127606.097561
Mid-Career 90th Percentile Salary 167909.090909 143823.529412 147689.0625 152769.014085 181926.829268

Tidying data for the following analysis

Because latitide and longitude of schools are needed in the following analysis, we decide to extract them at first. As a result, we can improve the analysis corresponsding to regions further.

In [25]:
# Merge data together
college_type2 = college_type.drop(['Engineering', 'Ivy League', 'Liberal Arts', 'Party', 'State'], axis = 1)
college_region2 = college_region.drop(['California', 'Midwestern','Northeastern', 'Southern', 'Western'], axis = 1)
college_merge = pd.merge(left = college_type2, right = college_region2, left_on = 'School Name', right_on = 'School Name')
college_merge = college_merge.dropna()
college_merge2 = college_merge.iloc[:, 0:9]

#There are some location can not be searched by geopy for some reason. 
#The following modification are tested in google map
#Manually changed some addresses to other format
college_merge2['Name'] = college_merge2['School Name'].str.replace(r"\(.*\)","")
college_merge2['Name'] = college_merge2['Name'].str.replace("State University of New York ","SUNY")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Alaska, Anchorage","University of Alaska Anchorage")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Nevada, Las Vegas","4505 S Maryland Pkwy, Las Vegas, NV 89154")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Texas at El Paso","500 W University Ave, El Paso, TX 79968")
college_merge2['Name'] = college_merge2['Name'].str.replace("California State University, Dominguez Hills","California State University Dominguez Hills")
college_merge2['Name'] = college_merge2['Name'].str.replace("California State University, Sacramento","6000 J Street, Sacramento, CA 95819")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Farmingdale","Farmingdale State College")
college_merge2['Name'] = college_merge2['Name'].str.replace("State University of New York ","SUNY")
college_merge2['Name'] = college_merge2['Name'].str.replace("Bowling Green State University","Bowling Green, OH 43403")
college_merge2['Name'] = college_merge2['Name'].str.replace("Georgia Institute of Technology","North Ave NW, Atlanta, GA 30332")
college_merge2['Name'] = college_merge2['Name'].str.replace("Polytechnic University of New York, Brooklyn","Brooklyn, NY 11201")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Maryland, College Park","University of Maryland-College Park")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Potsdam","State University of New York at Potsdam")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Oneonta","108 Ravine Pkwy, Oneonta, NY 13820")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Geneseo","State University of New York at Geneseo")
college_merge2['Name'] = college_merge2['Name'].str.replace("California State University, Fullerton","800 N State College Blvd, Fullerton, CA 92831")
college_merge2['Name'] = college_merge2['Name'].str.replace("Texas A&M University","400 Bizzell St, College Station, TX 77843")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Alabama at Huntsville","301 Sparkman Dr NW, Huntsville, AL 35899")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of California at Los Angeles","California, Los Angeles, 90095")

#Temporary lists to store information of latitude and longitude 
latitude = []
longitude = []
states =[]
counter = 0
#This is an access to search location
geolocator = Nominatim(user_agent="specify_your_app_name_here")
#Iterate through each rows
for i, x in college_merge2.iterrows():
    #Using try except to extend socket time and prevent timeout error
    try:
        #Using package geopy to get longitude and latitude
        location = geolocator.geocode(x['Name'])
        latitude.append(location.latitude)
        longitude.append(location.longitude)
        state = location.address.split(',')
        states.append(state[-3])

    except:
        location = geolocator.geocode(x['Name'])
        latitude.append(location.latitude)
        longitude.append(location.longitude)
        state = location.address.split(',')
        states.append(state[-3])

college_merge2['Latitude'] = latitude
college_merge2['Longitude'] = longitude
college_merge2['State'] = states
In [26]:
#Take the difference of students in different locations into consideration
df.State = df.State.str.replace(' ','') # Standardize format of locations
college_merge2.State = college_merge2.State.str.replace(' ','')
college_merge_final = df.merge(college_merge2, left_on = 'State', right_on = 'State')
college_merge_final = college_merge_final.dropna()
In [27]:
# Manually pick the first 8 states among all levels
college_state = college_merge_final.iloc[:,[0,5,7,8,9,10,11,12]]
In [28]:
st = ['NewJersey', 'Massachusetts', 'Connecticut','Pennsylvania', 'Maryland','RhodeIsland','NewHampshire','SouthDakota','Colorado']
temp_state = college_state.groupby(by = 'State').mean()
college_state_salary = pd.DataFrame()
for i in st:
    college_state_salary[i] = temp_state.loc[i,:].T
college_state_salary.index = college_state_salary.index.str.replace('_x','')
college_state_salary.plot(kind = 'barh', figsize = (25,20))
plt.title('Salary by States', fontsize = 30 )
plt.xlabel('Salary', fontsize = 30)
plt.yticks(fontsize = 25)
plt.xticks(fontsize = 25)
plt.legend(fontsize = 20)
Out[28]:
<matplotlib.legend.Legend at 0x19b66e1ba90>

Hence from the graphs above we can see that california and northeastern regions have the best starting and mid career salary in the U.S., which is close to our impression to those areas. To be more specific, the first 8 states that have more salary assemble in the northeast, which indicates a clearer conclusion for graduates.

2.2 Further analysis about regions and school types

After seperate analysis about regions and types of universities, we are curious about the comprehensive effect of the two factors, so we are going to combine them to make a deeper analysis.

Create a new variable 'Increasing rate' to reveal the potential career developing ability.

In [29]:
# Give priority to types of schools
college_merge2['Increasing rate'] = college_merge2['Mid-Career Median Salary_x']/college_merge2['Starting Median Salary_x']
college_merge3 = college_merge2.set_index(['School Type', 'State'])
college_merge3 = college_merge3.sort_values(by = 'Increasing rate', ascending = False)
college_merge3.head(10)
Out[29]:
School Name Starting Median Salary_x Mid-Career Median Salary_x Mid-Career 10th Percentile Salary_x Mid-Career 25th Percentile Salary_x Mid-Career 75th Percentile Salary_x Mid-Career 90th Percentile Salary_x Region Name Latitude Longitude Increasing rate
School Type State
Ivy League NewHampshire Dartmouth College 58000 134000 63100 90200 234000 321000 Northeastern Dartmouth College 43.705244 -72.290530 2.310345
Liberal Arts Indiana DePauw University 41400 88300 49500 57400 133000 185000 Midwestern DePauw University 39.639768 -86.861676 2.132850
Ivy League Connecticut Yale University 59100 126000 58000 80600 198000 326000 Northeastern Yale University 41.257131 -72.989670 2.131980
Liberal Arts Connecticut Wesleyan University (Middletown, Connecticut) 46500 97900 42000 62500 126000 215000 Northeastern Wesleyan University 41.555903 -72.655988 2.105376
State Colorado University of Colorado - Boulder (UCB) 47100 97600 51600 69000 128000 187000 Western University of Colorado - Boulder 40.006878 -105.272776 2.072187
NewYork State University of New York (SUNY) at Albany 44500 92200 47000 63100 135000 209000 Northeastern SUNY at Albany 42.686431 -73.815715 2.071910
Party NewYork State University of New York (SUNY) at Albany 44500 92200 47000 63100 135000 209000 Northeastern SUNY at Albany 42.686431 -73.815715 2.071910
State OswegoCounty State University of New York (SUNY) at Oswego 38000 77800 40400 53000 115000 169000 Northeastern SUNY at Oswego 43.452013 -76.541845 2.047368
Liberal Arts NewYork Colgate University 52800 108000 60000 76700 167000 265000 Northeastern Colgate University 42.821071 -75.536975 2.045455
State NewYork State University of New York (SUNY) at Oneonta 37500 76700 40000 54300 97700 155000 Northeastern 108 Ravine Pkwy, Oneonta, NY 13820 42.465682 -75.061711 2.045333
In [30]:
# Give priority to school regions
college_merge4 = college_merge2.set_index(['State','School Type', ])
college_merge4 = college_merge4.sort_values(by = 'Increasing rate', ascending = False)
college_merge4.head(10)
Out[30]:
School Name Starting Median Salary_x Mid-Career Median Salary_x Mid-Career 10th Percentile Salary_x Mid-Career 25th Percentile Salary_x Mid-Career 75th Percentile Salary_x Mid-Career 90th Percentile Salary_x Region Name Latitude Longitude Increasing rate
State School Type
NewHampshire Ivy League Dartmouth College 58000 134000 63100 90200 234000 321000 Northeastern Dartmouth College 43.705244 -72.290530 2.310345
Indiana Liberal Arts DePauw University 41400 88300 49500 57400 133000 185000 Midwestern DePauw University 39.639768 -86.861676 2.132850
Connecticut Ivy League Yale University 59100 126000 58000 80600 198000 326000 Northeastern Yale University 41.257131 -72.989670 2.131980
Liberal Arts Wesleyan University (Middletown, Connecticut) 46500 97900 42000 62500 126000 215000 Northeastern Wesleyan University 41.555903 -72.655988 2.105376
Colorado State University of Colorado - Boulder (UCB) 47100 97600 51600 69000 128000 187000 Western University of Colorado - Boulder 40.006878 -105.272776 2.072187
NewYork State State University of New York (SUNY) at Albany 44500 92200 47000 63100 135000 209000 Northeastern SUNY at Albany 42.686431 -73.815715 2.071910
Party State University of New York (SUNY) at Albany 44500 92200 47000 63100 135000 209000 Northeastern SUNY at Albany 42.686431 -73.815715 2.071910
OswegoCounty State State University of New York (SUNY) at Oswego 38000 77800 40400 53000 115000 169000 Northeastern SUNY at Oswego 43.452013 -76.541845 2.047368
NewYork Liberal Arts Colgate University 52800 108000 60000 76700 167000 265000 Northeastern Colgate University 42.821071 -75.536975 2.045455
State State University of New York (SUNY) at Oneonta 37500 76700 40000 54300 97700 155000 Northeastern 108 Ravine Pkwy, Oneonta, NY 13820 42.465682 -75.061711 2.045333

From the perspective of school type, although party schools have the amazingly highest mid-career salary, Ivy League and Liberal Arts schools have the highest salary increasing rate. From the other perspective, schools in Indiana, New Hampshine and Connecticut own the fasteast increasing rate, who are almost in the northeastern region.

As a result, for students who would like to get a fast growth in their career, they would better to choose Ivy League and Liberal Arts schools in Indiana, New Hampshine and Connecticut. On the other hand, the result also answer our question that Ivy League owns a better location which has a better academic atmosphere. This is the benefit to study there.

2.3 Analysis about majors

Since major-choosing is more subjective and isolated from regions and type of schools, we decide to analyze major seperately.

In [32]:
college_major = pd.read_csv('degrees-that-pay-back.csv')

#Since NaN doesn't have function replace, we convert it to 0 for now
for col in cols:
    college_major[col] = college_major[col].str.replace('$','')
    college_major[col] = college_major[col].str.replace(',','')
tem = ['Starting Median Salary','Mid-Career Median Salary','Mid-Career 10th Percentile Salary','Mid-Career 25th Percentile Salary','Mid-Career 75th Percentile Salary','Mid-Career 90th Percentile Salary']
college_major[tem] = college_major[tem].astype(float)
tem =['Starting Median Salary','Mid-Career Median Salary','Mid-Career 75th Percentile Salary']
college_major1 = college_major.sort_values(by=tem, ascending=False)[0:15]
college_major1 = college_major1.reset_index()
In [34]:
from matplotlib import rc

# y-axis in bold
rc('font', weight='bold')
 
# Values of each group
bars1 = college_major1['Mid-Career 90th Percentile Salary']
bars2 = college_major1['Mid-Career Median Salary']
bars3 = college_major1['Starting Median Salary']
 
# Heights of bars1 + bars2
bars = np.add(bars1, bars2).tolist()
 
# The position of the bars on the x-axis
r = [x for x in range(0,15)]
 
# Names of group and bar width
names = college_major1['Undergraduate Major']
barWidth = 1
 
# Create brown bars
plt.bar(r, bars1, color='#7f6d5f', edgecolor='white', width=barWidth, label='Mid-Career 90th Percentile Salary')
# Create green bars (middle), on top of the firs ones
plt.bar(r, bars2, bottom=bars1, color='#557f2d', edgecolor='white', width=barWidth, label='Mid-Career Median Salary')
# Create green bars (top)
plt.bar(r, bars3, bottom=bars, color='#2d7f5e', edgecolor='white', width=barWidth, tick_label='Starting Median Salary', label='Starting Median Salary')
# Custom X axis
plt.title('Salary for Different Stages of Top 15 Majors', fontsize = 15)
plt.xticks(r, names,rotation='vertical',fontsize = 15)
plt.xlabel(['High to low: Mid-Career 90th Percentile Salary','Mid-Career Median Salary','Starting Median Salary'])
plt.yticks(fontsize = 15)
plt.rcParams['figure.figsize'] = (9.0, 8.0)
# Show graphic
plt.legend(fontsize = 10)
Out[34]:
<matplotlib.legend.Legend at 0x19b675dceb8>

According to the plot, we can clearly see the top 15 majors with the highest salary in the U.S. In average, students studying the chemical engineering earns the highest salary. Also, we are glad to see Information Systems is one of the top 15 majors with high salary.

Plot the result of rankings of different majors

In [35]:
pip install squarify
The following command must be run outside of the IPython shell:

    $ pip install squarify

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more informations on how to install packages:

    https://docs.python.org/3/installing/
In [37]:
import squarify
college_major2 = college_major.sort_values('Percent change from Starting to Mid-Career Salary', ascending=False)[0:15]

height = college_major2['Undergraduate Major']
bars = college_major2['Percent change from Starting to Mid-Career Salary']
y_pos = np.arange(len(bars))
 
# Create horizontal bars
plt.barh(y_pos,bars)
 
# Create names on the y-axis
plt.title('Percent change from Starting to Mid-Career Salary of Top 15 Majors', fontsize = 20 )
plt.yticks(y_pos, height, fontsize = 15)
plt.ylabel("Major",fontsize = 15)
plt.xlabel('Precent Change Rate',fontsize = 15)
plt.rcParams['figure.figsize'] = (10.0, 10.0)
# Show graphic
plt.show()

From this graph we can see that Math, Philosophy, International Relations, Economics and Marketing are the top 5 undergraduate majors that offer the most salary increase down the line. And Majors like nursing and physician assistant are among the ones which do not see much salary hike in the coming years. Also, we are sad to notice that IS is not in the top list of increasing rate from starting to mid-career.

Hence from the above two graphs you can see that physician assistant has the hugest starting median salary in all branches but when talking about percent change from starting to mid career salary then it is the least. So even if is the most in the beginning one cannot expect much increment after that.

Among all the school types we can see that Ivy league students get higher salary on graduating and even after 10 years. And so there are benfits of studying in Ivy league as compared to the other category of colleges.

2.4 Visualization using mapbox

For guiding graduates more directly, we use mapbox to create a 'salary map' as a guide to show them which regions have a better developing chance.

In [38]:
# Cast Mid-career median salary into string temperarily to make it consistant with school name to be the label in our map
college_merge4['Mid-Career Median Salary_x_1'] = college_merge4['Mid-Career Median Salary_x'].astype(str)
mapbox_access_token = 'pk.eyJ1IjoiamNuMTk5NzI5IiwiYSI6ImNrMm9xNzdiOTE2OXYzY205OG52ZmxhZGQifQ.O2FQnsWlbbYwKPtU0rBTlQ'
college_map_data = go.Scattermapbox(
        lon = college_merge4['Longitude'],
        lat = college_merge4['Latitude'],
        text = college_merge4['School Name'] + '<br>' + 'Mid-Career Median Salary: ' + college_merge4['Mid-Career Median Salary_x_1'],
        hoverinfo='text',
        mode = 'markers',
        marker = dict(
                    color = college_merge4['Mid-Career Median Salary_x'],
                    symbol = 'circle',
                    opacity = .5
                )
)

college_map_layout = go.Layout(
        title = 'Mid-Career Salary of Universities In U.S',
        mapbox=go.layout.Mapbox(
            accesstoken=mapbox_access_token,
            zoom=1
        )
    )

college_map = go.Figure(data=college_map_data, layout=college_map_layout)
college_map.show()

As the map shows, the points in lighter color represent a higher mid-career salary, and the darker points represent a lower level. The map confirm our conclusion that northeastern of America does have a higher salary level, and help graduates a clear guide when choose the working location to pursue a better development.

3. Regression Model and Hypothesis testing

Null Hypothesis

We are looking at the impacts of:

The salary of mid-career.

Null Hypothesis: None of starting salary, region and school type and student's SAT score have noticeable impacts on the mid-career salary.

To test the null hypothesis, we will perform a Multiple Linear Regression on our dataset by using scikit-learn.

We will use our tidied datasets for the features and the target.

Features: Mid-career salary Target: dependent variables.

Since region and school type are categorical variables, we will create dummy variable for both variabls. To achieve that, we used pd.get_dummies() in this case.

After setting up the data, we will set up X and Y for our use in scikit-learn LinearRegression() function. Then, fit the model.

In [40]:
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm

Ordinary Least Squares Regression

In [41]:
tem = ['Region','School Type','Starting Median Salary_x', 'Math','Evidence-Based Reading and Writing']
df_tem = college_merge_final[tem]
X = pd.get_dummies(df_tem,drop_first=True) 
X = sm.add_constant(X)
y = college_merge_final['Mid-Career Median Salary_x']
linear = linear_model.LinearRegression()
model = linear.fit(X,y)
linear.score(X,y)
Out[41]:
0.8833262691566798

R square is around 0.88, which means that 88% of the variation is explained by our model.

In [42]:
coef = linear.coef_.tolist()
tem = list(X.columns)
for f, c in zip(tem, coef):
    print("Feature: {}, Coefficient: {}".format(f,c))
Feature: const, Coefficient: 0.0
Feature: Starting Median Salary_x, Coefficient: 1.9768760568572907
Feature: Math, Coefficient: 4.219515193303181
Feature: Evidence-Based Reading and Writing, Coefficient: 5.239367500477243
Feature: Region_Midwestern, Coefficient: -3202.0420213391053
Feature: Region_Northeastern, Coefficient: 489.2315580390664
Feature: Region_Southern, Coefficient: -1116.0259497465772
Feature: Region_Western, Coefficient: -2356.8183148137105
Feature: School Type_Ivy League, Coefficient: 11175.460743149551
Feature: School Type_Liberal Arts, Coefficient: 10335.682247806099
Feature: School Type_Party, Coefficient: 6330.246162528429
Feature: School Type_State, Coefficient: 3898.26368622417

From the coefficients above, we can confirm our pervious estimation. We can see that, in average, studying in an Ivy-league school has the highest positive impact on your future salary. Also, students graduated from Liberal Arts tend to make more money than party and state school in average.

On the other hand, we are surprised to see that only northeastern has a postive coefficient. In average, students graduated from other regions tend to have negative impact on their career. We specilate that can be caused by most Lvy-league schools are located in Northeastern.

We want to observe more about this result by p-values, t-values, and std-err. Then, we need Ordinary Least Squares Regression model.

In [43]:
model = sm.OLS(endog=y, exog=X)
results = model.fit()
results.summary()
Out[43]:
OLS Regression Results
Dep. Variable: Mid-Career Median Salary_x R-squared: 0.883
Model: OLS Adj. R-squared: 0.877
Method: Least Squares F-statistic: 141.1
Date: Thu, 12 Dec 2019 Prob (F-statistic): 3.20e-89
Time: 15:00:58 Log-Likelihood: -2140.0
No. Observations: 217 AIC: 4304.
Df Residuals: 205 BIC: 4345.
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -1.612e+04 7895.291 -2.041 0.043 -3.17e+04 -549.781
Starting Median Salary_x 1.9769 0.083 23.737 0.000 1.813 2.141
Math 4.2195 6.168 0.684 0.495 -7.940 16.379
Evidence-Based Reading and Writing 5.2394 11.418 0.459 0.647 -17.273 27.752
Region_Midwestern -3202.0420 1437.196 -2.228 0.027 -6035.623 -368.461
Region_Northeastern 489.2316 1391.546 0.352 0.726 -2254.345 3232.808
Region_Southern -1116.0259 1354.024 -0.824 0.411 -3785.624 1553.572
Region_Western -2356.8183 1458.118 -1.616 0.108 -5231.649 518.013
School Type_Ivy League 1.118e+04 2193.557 5.095 0.000 6850.636 1.55e+04
School Type_Liberal Arts 1.034e+04 2107.528 4.904 0.000 6180.474 1.45e+04
School Type_Party 6330.2462 2007.049 3.154 0.002 2373.142 1.03e+04
School Type_State 3898.2637 1800.931 2.165 0.032 347.542 7448.985
Omnibus: 6.894 Durbin-Watson: 1.547
Prob(Omnibus): 0.032 Jarque-Bera (JB): 9.048
Skew: 0.213 Prob(JB): 0.0108
Kurtosis: 3.905 Cond. No. 1.16e+06

Given the assumption that the covariance matrix of the errors is correctly specified, we may say that the standard Errors are small enough for a good performance.

Looking at the 'coef' column above, we observe that obviously the School Type_Ivy League has significant impact on the mid-career salary. Also, the p-value of School Type_Ivy League is 0, perfectly below a critical value of 0.05, which confirms more about the findings.

In addition, we are more condifent that it is not true only student graduated from northeastern tend to have a positive impact in their career in average. Based on the observation of the p-values above ('P>|t|' column in the middle chart), we can see that the p-value of Region_Northeastern is way greater than a critical value of 5% (p-value of 0.05).

Also, the p-value of SAT verbal and math are all bigger than 5%. Therefore, we conclude that SAT score are not a significant factors for predicting mid-career salary. We should remove this two variables from the model.

All these mean that we should reject the null hypothesis because obviously the school type and region have significant impact on mid-career salary.

4. Prediction and Machine Learning

Here we will split up the dataset into training and testing data for both variables. The percentage of Test/train split depends on different programmer. For our data, we will use 20% of the data for testing purpose and 80% for training purposes. We will generate mutiple linear regression model using the training data first. Then, we will make the predictions for the testing data. After that, we will compare the predictions with the actual results (Y_test).

In [35]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)
In [36]:
# create and fit the linear model
linear = linear_model.LinearRegression()
model = linear.fit(X_train, y_train)

# generate predictions for player Rating to compare with y_test data
predictions = linear.predict(X_test)

# display first 5 results of our predicted mid-career salary.
predictions[0:5]
Out[36]:
array([82031.99775949, 68242.71513675, 81521.28120201, 79498.25914632,
       76051.6727892 ])
In [37]:
plt.figure(figsize=(12,12))
plt.title("Predicted vs. Actual Values for Mid-career Salary", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()
In [38]:
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted')
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(predictions, hist=False, label="Linear Regression Predictions", ax=ax)
plt.show()
In [39]:
print("Accuracy (scale of 0 to 1): {}".format(model.score(X_test, y_test)))
Accuracy (scale of 0 to 1): 0.8811988390054758

From two plots above, we obviously observe that our model predicted the mid-career salary very well for the testing dataset. Our model also provides a score of the actual accuracy (1 is perfect), and our accuracy is around 0.88, which is a consideratable high accuracy score for the model.

Appendix

In [40]:
#draw a bar plot of the salary increase by school type
college_type_salary = college_type_salary.reindex(index = ['Starting Median Salary','Mid-Career 10th Percentile Salary','Mid-Career 25th Percentile Salary','Mid-Career Median Salary','Mid-Career 75th Percentile Salary','Mid-Career 90th Percentile Salary',])
xn = ['salary_increase']
college_type_rise = pd.DataFrame(columns = ['Engineer', 'Party', 'Liberal Arts', 'Ivy League','State'])
college_type_rise.loc[1] = [college_type_salary.iloc[3,i] -college_type_salary.iloc[0,i] for i in range(len(college_type_salary.columns))]
college_type_rise.plot(kind = 'bar')
plt.xticks([1],xn)
plt.ylabel('Salary', fontsize = 17)
plt.xlabel('Schoole types', fontsize = 17)
plt.title('salary increase by school type')
Out[40]:
Text(0.5, 1.0, 'salary increase by school type')
In [41]:
college_region_salary.plot(kind = 'bar', figsize = (20,10))
plt.ylabel('Salary', fontsize = 30)
plt.xticks(rotation = 'vertical', fontsize = 17)
plt.yticks(fontsize = 20)
plt.legend(fontsize = 20)
plt.show()